Speeding Up Relational Data Mining by Learning to

نویسندگان

  • Frank DiMaio
  • Jude Shavlik
چکیده

he motivation behind multi-relational data mining is wledge discovery in relational databases containing tiple related tables. One difficulty relational data ing faces is managing intractably large hypothesis ces. We attempt to overcome this difficulty by first pling the hypothesis space. We generate a small set of otheses, uniformly sampled from the space of didate hypotheses, and evaluate this set on actual . These hypotheses and their corresponding luation scores serve as training data in learning an roximate hypothesis evaluator. We use this roximate evaluation to quickly rate potential otheses without needing to score them on actual data. test our approximate clause evaluation algorithm g the popular Inductive Logic Programming (ILP) em Aleph. We use a neural network to approximate hypothesis-evaluation function. The trained neural ork replaces Aleph’s hypothesis evaluation on actual , scoring potential rules in time independent of the ber of examples. Our approximate evaluator can be used in a heuristic search to help escape local ima. We test the neural network's ability in learning hypothesis-evaluation function on four benchmark ILP ains; the neural network is able to accurately roximate the hypothesis-evaluation function. ntroduction and Background ost data mining techniques assume that the data ts in a form that can be easily converted into a set of d-length feature vectors (where each example is verted into a fixed-size array of real numbers, integers, nominal attributes). For many multi-relational sets, such a conversion – when even possible – is egant and scales poorly. Conversely, Inductive Logic gramming (ILP) [1] natively handles multi-relational . ILP's natural treatment of multi-relational datasets ids the problems associated with converting examples feature vectors. As a further advantage, its rules have full expressive power of first-order logic, making for and human-readable hypotheses. LP systems have been proven quite successful in structing a set of accurate rules, even on datasets with y relations. Such systems have been successfully loyed in a number of varied domains, including molecular biology, engineering design, natural language processing, and software analysis. ILP systems combine background domain knowledge and categorized training data in constructing a set of rules (hypotheses) in first-order logic. Formally, given a training set of positive examples E, negative examples E, and background knowledge B, all as set of clauses in first-order logic, ILP's goal is finding a hypothesis (a set of clauses in first-order logic) h, such that − + ⇒/ ∪ ⇒ ∪ E h B E h B (1) That is, given the background knowledge and the hypothesis, one can deduce all of the positive examples, and none of the negative examples. In real world applications, these constraints are usually relaxed somewhat, allowing h to explain most positive examples and few negative examples. The algorithm underlying most ILP systems is basically the same. It searches for a clause in the subsumption lattice [2], evaluating candidate clauses on the training data. The search begins with an initial candidate clause, and considers hypothesis generation as a local search problem in the subsumption lattice. The starting point for the search and the type of local search depends on the implementation of the ILP system. The subsumption lattice is constructed based on the idea of specificity of clauses. Specificity here refers to implication; a clause C is more specific than a clause S if S ⇒ C. In general, it is undecidable whether or not one clause in first-order logic implies another [3], so ILP systems use the weaker notion of Plotkin's θsubsumption. Subsumption implies implication, but implication does not imply subsumption. Subsumption of candidate clauses puts a partial ordering on all clauses in hypothesis space. With this partial ordering, a lattice of clauses can be built. ILP implementations perform some type of local search over this lattice when considering candidate hypotheses. Most ILP implementations also use a standard greedy covering algorithm. After completing a local search of the subsumption lattice, the best rule evaluated is accepted, and all the positive examples covered (explained) by the rule are removed from the dataset. The process is repeated until every positive example is covered. The major distinction separating various ILP implementations is the strategy used in exploring the subsumption lattice. Algorithms fall into two main categories (with some exceptions): general-to-specific ("top-down") [4] and specific-to-general ("bottom-up") enumeration of the subsumption lattice [5]. Within this framework, a variety of common local search strategies have been employed, including breadth-first search [6], depth-first search, heuristic-guided hill-climbing variants [5,6], uniform random sampling [7], and rapid random restarts [8]. Our work provides a general framework for increasing the speed of any ILP algorithm, regardless of the order candidate clauses are evaluated. One complaint levied against ILP systems is that they scale poorly to large datasets. Srinivasan [7] investigated the performance of ILP algorithms in general, and found that the worst-case running-time depends on both the size of the subsumption lattice and the time required for clause evaluation. The first factor – the search space size – depends on the maximum allowed clause length and the number of terms in an example's saturation. The idea of saturation is used by a number of ILP systems to put a bound on the size of the subsumption lattice. Saturation involves first choosing a positive example from the training set. Using the background knowledge, saturation constructs the most specific, fullyground clause that entails the chosen example. It is constructed by applying all possible substitutions for variables in B with ground terms in B. This clause is called the chosen example's bottom clause, and it serves as the bottom element (⊥) in the subsumption lattice over which ILP searches. That is, all clauses considered by ILP (in the subsumption lattice) subsume (and thus imply) the saturated example. As a simple example, suppose we are given background knowledge (using Prolog notation where ground atoms are denoted with an initial lowercase letter and variables are denoted with an initial uppercase letter):

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Set-Oriented Indexes for Data Mining Queries

One of the most popular data mining methods is frequent itemset and association rule discovery. Mined patterns are usually stored in a relational database for future use. Analyzing discovered patterns requires excessive subset search querying in large amount of database tuples. Indexes available in relational database systems are not well suited for this class of queries. In this paper we study...

متن کامل

A Multi-relational Decision Tree Learning Algorithm - Implementation and Experiments

We describe an efficient implementation (MRDTL-2) of the Multirelational decision tree learning (MRDTL) algorithm [19] which in turn was based on a proposal by Knobbe et al. [15] We describe some simple techniques for speeding up the calculation of sufficient statistics for decision trees and related hypothesis classes from multi-relational data. Because missing values are fairly common in many...

متن کامل

SRL 2003 IJCAI 2003 Workshop on Learning Statistical Models from Relational

We present a general approach to speeding up afamily of multi-relational data mining algorithmsthat construct and use selection graphs to obtain theinformation needed for building predictive mod-els (e.g., decision tree classifiers) from relationaldatabase. Preliminary results of our experimentssuggest that the proposed method can yield 1-2 or-ders of magnitude reduc...

متن کامل

Speeding Up Inference in Statistical Relational Learning by Clustering Similar Query Literals

Markov logic networks (MLNs) have been successfully applied to several challenging problems by taking a “programming language” approach where a set of formulas is hand-coded and weights are learned from data. Because inference plays an important role in this process, “programming” with an MLN would be significantly facilitated by speeding up inference. We present a new meta-inference algorithm ...

متن کامل

Speeding-up structured probabilistic inference using pattern mining

In many domains where experts are the main source of knowledge, e.g., in reliability and risk management, a framework well suited for modeling, maintenance and exploitation of complex probabilistic systems is essential. In these domains, models usually define closed-world systems and result from the aggregation of multiple patterns repeated many times. Object Oriented-based Frameworks such as P...

متن کامل

Fast Relational Data Mining Query Optimization for Improving the Efficiency of Relational Data Mining Systems

Data mining is the process of building predictive or descriptive models based on a large data set, often stored in a relational database. Propositional data mining systems require that the data is converted into one single table. Relational data mining systems, on the other hand, can build models directly from the relational database. While building a model, relational data mining systems execu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003